Random Forest
Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma
Random Forest Overview
Ensemble machine learning method based on a large number of decision trees voting to predict a classification
Benefits compared to decision tree:
Able to function with incomplete data -Lower likelihood of an overfit -Improved prediction accuracy
Bootstrap Sampling (Bagging)
Each decision tree uses a random sample of the original dataset
Using a subset of the dataset reduces the probability of an overfit model
Rows with missing data will often be left out of the sample, improving performance
Performed with replacement
Random Feature Selection
A random set of features is selected for each node in training
Information about feature importance may be saved and applies in future iterations
Even with automated random feature selection, feature selection and engineering prior to training may improve performance
Cross Validation
Validation of performance of model
Resampling method similar to bootstrapping, but without replacement
Allows approximation of the general performance of a model
Prediction
Each trained decision tree produces its own prediction
Decision trees are independent, and were trained on different subsets of both data and features
Ensemble Voting
The results from each decision tree are combined into a voting classifier
The mode of the classification results will be the final prediction
Dataset
South African Red List
Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
Purpose
Predict whether or not an unknown plant is threatened based on the above characteristics
Visuals 1
Distribution Range
Visuals 2
Cramer’s V Association
Analysis
5 separate random forest models were created using separate methods of normalization
Data Preparation
Preprocessing
Encode categorical features into numerical / factor features
Split the training set into a training and test set, avoiding class imbalance
Preprocessing
Class Imbalance
Resample smaller classes in order to approximate equal classes
Training on imbalanced datasets will bias predictions to the larger class
Normalization
Apply 5 normalization techniques to both training and test datasets
Min-Max
Z-Score
Max Absolute Value
L1 Norm
L2 Norm
Prediction
Combine results into a vector
Identify the most frequently predicted class
Iterate over entire test set, storing results
Generate a confusion matrix, calculate the sensitivity, and precision for each category
Iterate after tuning if necessary
Results
Range was found to be the strongest predictor of extinction
Habitat loss is the second strongest predictor of extinction